(C) 2017-2019 by Damir Cavar
This is a tutorial related to the discussion of a WordSense disambiguation and various machine learning strategies discussed in the textbook Machine Learning: The Art and Science of Algorithms that Make Sense of Data by Peter Flach.
This tutorial was developed as part of my course material for the courses Machine Learning and Advanced Natural Language Processing in the at Indiana University.
For a simple Bayesian implementation of a Word Sense Disambiguation algorithm we will use the WordNet NLTK module. We import it in the following way:
In [1]:
from nltk.corpus import wordnet
For a word that we want to disambiguate, we need to get all its synsets:
In [2]:
mySynsets = wordnet.synsets('bank')
print(mySynsets)
For each synset we need to get its definition and the examples to use them as bags of words for a comparison:
In [3]:
for s in mySynsets:
print(s.name())
text = " ".join( [s.definition()] + s.examples() )
print(text, "\n", "-" * 20)
We will need to join a list of lists into one list, that is, we need to flatten a list of lists. To achive this, we can use the following code:
In [4]:
import itertools
lOfl = [["this"], ["is","a"], ["test"]]
print(list(itertools.chain.from_iterable(lOfl)))
What we should do is to tokenize and part-of-speech tag the text, that is the descriptions and the examples. We can use NLTK's word_tokenize and pos_tag modules:
In [5]:
from nltk import word_tokenize, pos_tag
Now we can tokenize and PoS-tag the texts:
In [6]:
from nltk.corpus import stopwords
stopw = stopwords.words("english")
for s in mySynsets:
print(s.name())
text = pos_tag(word_tokenize(s.definition()))
text += list(itertools.chain.from_iterable([ pos_tag(word_tokenize(x)) for x in s.examples() ]))
text2 = [ x for x in text if x[0] not in stopw ]
print(text2, "\n", "-" * 20)
In [7]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
wordnet_lemmatizer.lemmatize('dogs')
Out[7]:
The first step that we would take with a text that contains the word that we want to disambiguate is to find its position in the token list.
In [8]:
example = "John saw the dogs barking at the cats."
keyword = "dog"
tokens = word_tokenize(example)
lemmas = [ wordnet_lemmatizer.lemmatize(x) for x in tokens ]
pos = -1
try:
pos = lemmas.index(keyword)
except ValueError:
pass
print("Position:", pos)
print(lemmas)
In [9]:
posTokens = pos_tag(tokens)
print("Lemma:", lemmas[pos])
print(" PoS:", posTokens[pos])
print(" Tag:", posTokens[pos][1])
print(" MTag:", posTokens[pos][1][0])
In [10]:
category = posTokens[pos][1][0]
print(category)
In [11]:
wType = None
if category == 'N':
wType = wordnet.NOUN
elif category == 'V':
wType = wordnet.VERB
elif category == 'J':
wType = wordnet.ADJ
elif category == 'R':
wType = wordnet.ADV
print("Type:", wType)
In [12]:
wordnet.synsets(keyword, pos=wType)
Out[12]:
In [13]:
for s in wordnet.synsets(keyword, pos=wType):
print(s.name())
text = pos_tag(word_tokenize(s.definition()))
text += list(itertools.chain.from_iterable([ pos_tag(word_tokenize(x)) for x in s.examples() ]))
print(text, "\n", "-" * 20)
In [ ]:
In [ ]:
In [ ]: